Predicting flight delay probability and delay time

contributed by Danyao Jin, Shannon Ma, Xinyi Li and Zifan Wang

We used a collected data from 5.7 million flights from January 2017 to December 2017 from the website of the BTS website. The data is made available in monthly files, with the option to select the fields for download. The specific data set used in this analysis can be found here.

Overview and motivation:

Flight delay remains to be a huge problem for travelers worldwide. A delay of flight is often defined as arriving or departing more than 15 minutes later than schedule. According to the U.S. Department of Transportation (DOT) Bureau of Transportation Statistics (BTS), 18.15% of all scheduled flights are delayed in U.S. in 2017. Some main causes of delay include late arrival of the last flight, National Aviation System delay (congested airports and air-traffic-jam), and airline’s reasons. 1

As a traveler, what can I do to avoid flight delay or minimize delay time? Which airline performs the best? Does travelling in a busy Monday increase my chance of suffering delay? Which time of a day might reduce the delay time? Do those big airports like John F. Kennedy International airport have the highest delay rates? Do main causes of delay vary according to airlines and airports?

In this project, we used the data of U.S. domestic flight in 2017 from Bureau of Transportation Statistics to explore factors influencing flight delay rate and delay time, construct regression model for delay prediction, and come up advices for choosing a best plan for a given route.

Initial questions:

Which factors influence the probability of flight delay and delay time? Some factors we considered as important and available from the database were airlines, day of week, time of day, airports, regions and routes. To how much extent do they influence the probability of flight delay and delay time? During the process, we noticed that the impact of weekday and daytime might vary according to airlines. We also conducted analyses stratified on airlines.

Part 0: Description of the data

The U.S. Department of Transportation (DOT) Bureau of Transportation Statistics (BTS) collected detailed information of carrier on-time performance, including each flight information, delay time, and delay reason. We downloaded the dataset from their website 3. We confined our research question to the latest year (2017) since a larger database could not be run on our computers.

library(tidyverse)
library(dplyr)
library(dslabs)
library(readr)
library(ggthemes)
library(RColorBrewer)
library(shiny)
library(plotly)
library(splitstackshape)
library(lsmeans)
library(rsconnect)
#read in dataset "flight2017.csv"
dat <- read_csv("C:/Users/jindanyao/Desktop/2018fall/2018fall/BST260/final project/database/flight2017.csv") 

Data cleaning (checking distributions and missing values)

# check missing values:
colSums(is.na(dat))
##                  X1                YEAR             QUARTER 
##                   0                   0                   0 
##               MONTH        DAY_OF_MONTH         DAY_OF_WEEK 
##                   0                   0                   0 
##             FL_DATE   OP_UNIQUE_CARRIER   OP_CARRIER_FL_NUM 
##                   0                   0                   0 
##              ORIGIN    ORIGIN_CITY_NAME    ORIGIN_STATE_ABR 
##                   0                   0                   0 
##                DEST      DEST_CITY_NAME      DEST_STATE_ABR 
##                   0                   0                   0 
##        CRS_DEP_TIME            DEP_TIME       DEP_DELAY_NEW 
##                   0               80308               80343 
##           DEP_DEL15     DEP_DELAY_GROUP        CRS_ARR_TIME 
##               80343               80343                   0 
##            ARR_TIME       ARR_DELAY_NEW           ARR_DEL15 
##               84674               95211               95211 
##     ARR_DELAY_GROUP           CANCELLED   CANCELLATION_CODE 
##               95211                   0             5591928 
##            DISTANCE       CARRIER_DELAY       WEATHER_DELAY 
##                   0             4645148             4645148 
##           NAS_DELAY      SECURITY_DELAY LATE_AIRCRAFT_DELAY 
##             4645148             4645148             4645148 
##                   X 
##             5674621

There are no missings for the year, quarter, month, day of month, day of weak, date, origin or destination city/states of the flights. There are 80,343 missing values for departure delay times, the main outcome of interest in our study. We assume that the missing values are due to flight cancellation. For this study, we will look at the flights with non-missing departure delay times (missing values will be automatically excluded from the plots or regression models.

We also look at the distributions of the relevant variables

summary(dat)
##        X1               YEAR         QUARTER          MONTH       
##  Min.   :      1   Min.   :2017   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:1418656   1st Qu.:2017   1st Qu.:2.000   1st Qu.: 4.000  
##  Median :2837311   Median :2017   Median :3.000   Median : 7.000  
##  Mean   :2837311   Mean   :2017   Mean   :2.516   Mean   : 6.546  
##  3rd Qu.:4255966   3rd Qu.:2017   3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :5674621   Max.   :2017   Max.   :4.000   Max.   :12.000  
##                                                                   
##   DAY_OF_MONTH    DAY_OF_WEEK      FL_DATE           OP_UNIQUE_CARRIER 
##  Min.   : 1.00   Min.   :1.00   Min.   :2017-01-01   Length:5674621    
##  1st Qu.: 8.00   1st Qu.:2.00   1st Qu.:2017-04-05   Class :character  
##  Median :16.00   Median :4.00   Median :2017-07-03   Mode  :character  
##  Mean   :15.76   Mean   :3.94   Mean   :2017-07-02                     
##  3rd Qu.:23.00   3rd Qu.:6.00   3rd Qu.:2017-09-29                     
##  Max.   :31.00   Max.   :7.00   Max.   :2017-12-31                     
##                                                                        
##  OP_CARRIER_FL_NUM    ORIGIN          ORIGIN_CITY_NAME  
##  Min.   :   1      Length:5674621     Length:5674621    
##  1st Qu.: 736      Class :character   Class :character  
##  Median :1679      Mode  :character   Mode  :character  
##  Mean   :2143                                           
##  3rd Qu.:3064                                           
##  Max.   :8402                                           
##                                                         
##  ORIGIN_STATE_ABR       DEST           DEST_CITY_NAME    
##  Length:5674621     Length:5674621     Length:5674621    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  DEST_STATE_ABR      CRS_DEP_TIME     DEP_TIME     DEP_DELAY_NEW    
##  Length:5674621     Min.   :   1   Min.   :   1    Min.   :   0.00  
##  Class :character   1st Qu.: 912   1st Qu.: 914    1st Qu.:   0.00  
##  Mode  :character   Median :1323   Median :1327    Median :   0.00  
##                     Mean   :1330   Mean   :1334    Mean   :  12.83  
##                     3rd Qu.:1735   3rd Qu.:1743    3rd Qu.:   6.00  
##                     Max.   :2359   Max.   :2400    Max.   :2755.00  
##                                    NA's   :80308   NA's   :80343    
##    DEP_DEL15     DEP_DELAY_GROUP  CRS_ARR_TIME     ARR_TIME    
##  Min.   :0.00    Min.   :-2.00   Min.   :   1   Min.   :   1   
##  1st Qu.:0.00    1st Qu.:-1.00   1st Qu.:1103   1st Qu.:1050   
##  Median :0.00    Median :-1.00   Median :1520   Median :1510   
##  Mean   :0.18    Mean   : 0.03   Mean   :1489   Mean   :1469   
##  3rd Qu.:0.00    3rd Qu.: 0.00   3rd Qu.:1920   3rd Qu.:1918   
##  Max.   :1.00    Max.   :12.00   Max.   :2359   Max.   :2400   
##  NA's   :80343   NA's   :80343                  NA's   :84674  
##  ARR_DELAY_NEW       ARR_DEL15     ARR_DELAY_GROUP   CANCELLED      
##  Min.   :   0.00   Min.   :0.00    Min.   :-2.00   Min.   :0.00000  
##  1st Qu.:   0.00   1st Qu.:0.00    1st Qu.:-1.00   1st Qu.:0.00000  
##  Median :   0.00   Median :0.00    Median :-1.00   Median :0.00000  
##  Mean   :  12.84   Mean   :0.18    Mean   :-0.23   Mean   :0.01457  
##  3rd Qu.:   7.00   3rd Qu.:0.00    3rd Qu.: 0.00   3rd Qu.:0.00000  
##  Max.   :2189.00   Max.   :1.00    Max.   :12.00   Max.   :1.00000  
##  NA's   :95211     NA's   :95211   NA's   :95211                    
##  CANCELLATION_CODE     DISTANCE      CARRIER_DELAY     WEATHER_DELAY    
##  Length:5674621     Min.   :  31.0   Min.   :   0      Min.   :   0     
##  Class :character   1st Qu.: 391.0   1st Qu.:   0      1st Qu.:   0     
##  Mode  :character   Median : 680.0   Median :   1      Median :   0     
##                     Mean   : 856.7   Mean   :  20      Mean   :   3     
##                     3rd Qu.:1097.0   3rd Qu.:  17      3rd Qu.:   0     
##                     Max.   :4983.0   Max.   :1934      Max.   :1934     
##                                      NA's   :4645148   NA's   :4645148  
##    NAS_DELAY       SECURITY_DELAY    LATE_AIRCRAFT_DELAY
##  Min.   :   0      Min.   :  0       Min.   :   0       
##  1st Qu.:   0      1st Qu.:  0       1st Qu.:   0       
##  Median :   2      Median :  0       Median :   4       
##  Mean   :  16      Mean   :  0       Mean   :  25       
##  3rd Qu.:  19      3rd Qu.:  0       3rd Qu.:  31       
##  Max.   :1605      Max.   :827       Max.   :1756       
##  NA's   :4645148   NA's   :4645148   NA's   :4645148    
##       X            
##  Length:5674621    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The minimum departure delay time is 0 (in this dataset, all early departures are set to 0), and the maximum departure delay time is 2755 minutes (i.e. 46 hours). After We checked on TripAdvisor and The Ten Worst Flight Delays In History, we believe that the maximum value of departure delay times in this dataset could be reasonable, so we won’t exclude it. Also, to make sure that we won’t be impacted by potential extreme values, we will not only perform linear regressions but also logistic regressions in our data analysis.

The minimum flight distance is 31 miles, and the longest flight distance is 4983 miles, which are also reasonable (see shortest US flight route from Barnstaple Municipal Airport on Cape Cod to Nantucket Memorial Airport and ~ 4000 miles distance from New York to Hawaii.

month_freq <- table(dat$MONTH)
day_freq <- table(dat$DAY_OF_WEEK)
carrier_freq <- table(dat$OP_UNIQUE_CARRIER)
state <- table(dat$ORIGIN_STATE_ABR)

month_freq <- as.data.frame(month_freq)
day_freq <- as.data.frame(day_freq)
carrier_freq <- as.table(carrier_freq)
state <- as.table(state)

month_freq
##    Var1   Freq
## 1     1 450017
## 2     2 410517
## 3     3 488597
## 4     4 468329
## 5     5 486483
## 6     6 494266
## 7     7 509070
## 8     8 510451
## 9     9 458727
## 10   10 479797
## 11   11 454162
## 12   12 464205
day_freq
##   Var1   Freq
## 1    1 839772
## 2    2 819499
## 3    3 830854
## 4    4 841765
## 5    5 846443
## 6    6 689412
## 7    7 806876
carrier_freq
## 
##      AA      AS      B6      DL      EV      F9      HA      NK      OO 
##  896348  185068  298654  923560  339541  103027   80172  156818  706527 
##      UA      VX      WN 
##  584481   70981 1329444
state
## 
##     AK     AL     AR     AZ     CA     CO     CT     FL     GA     HI 
##  36396  22285  13871 172947 756448 244856  22211 457489 377288 104697 
##     IA     ID     IL     IN     KS     KY     LA     MA     MD     ME 
##  13961  22513 362857  40048   9876  34653  66302 127175 101015   6790 
##     MI     MN     MO     MS     MT     NC     ND     NE     NH     NJ 
## 158271 142632 105635   9569  17430 162891  10961  22870   6180 121631 
##     NM     NV     NY     OH     OK     OR     PA     PR     RI     SC 
##  21609 168151 245371  71211  29441  74109 112754  26897  13755  30447 
##     SD     TN     TT     TX     UT     VA     VI     VT     WA     WI 
##   8332  81934    484 557534 115337 143040   5254   3274 153388  50878 
##     WV     WY 
##   2124   7549

The frequencies of month, day of week, carrier, and departure state of the flights are all in reasonable range as well.

The dataset is pretty clean, with very few missing values or extreme values. In this project, we excluded all cancelled flights and diverted flights. Delayed flight is delayed for 15 minutes or above.

Part 1: Exploratory analysis

Influence of airlines:

First, we looked at the delay percentage across 12 airlines in U.S.:

## Warning: package 'bindrcpp' was built under R version 3.5.1

The percentage of delays for most airlines is in the range of 15% to 25%. JetBlue has the highest delay of 27% and Hawaiian Airlines has the lowest flight delay of only 8.4%.

We continue to investigate the average time of delays (among delayed flights) and reasons.

Among flights with delays, you have to wait for 60-80 minutes for most airlines. ExpressJet and SkyWest tend to have longest waiting time of more than 80 minutes. Hawaiian Airlines and Southwest Airlines tends to have shortest time of 50-minute delays.

For the delay reasons, Carrier Delay and Late Arrival Delay is the two main reasons of delay for most airlines. Look, delays due to Weather problems are not as frequent as we we expect! (Carrier: delay due to carrier reasons, such as aircraft cleaning, fueling, maintenance, awaiting the arrival of connecting passengers and baggage. Late: delay due to the late arrival of the same aircraft at previous airport. NAS: Delay due to National Airspace System, such as non-extreme weather condition, heavy traffic volume and air traffic control. Security: Delay caused by evacuation of a terminal or concourse. Weather: Delay caused by extreme weather.)

Influence of departure time:

What time of the day are you most likely to be delayed?

dat0 <- dat %>% select(CRS_DEP_TIME, OP_UNIQUE_CARRIER, DEP_DEL15, DAY_OF_WEEK, DEP_DELAY_NEW)

Generate departure hour:

dat1 <- dat0 %>% mutate(DEP_HOUR=as.integer(as.numeric(CRS_DEP_TIME)/100))

Generate overall delay percentage, overall average delay hours, and delay percentage and average delay hours for each carrier (dat2 for percentage, dat3 for delay hour-only delayed flight included):

dat2<- dat1 %>%  filter(!is.na(DEP_DEL15)) %>%
  group_by(DEP_HOUR, DAY_OF_WEEK) %>%
  mutate(PERCENTAGE_OVERALL=mean(DEP_DEL15))
dat3<- dat1 %>% filter(!is.na(DEP_DELAY_NEW) & DEP_DEL15==1) %>%
  group_by(DEP_HOUR) %>%
  mutate(AVERAGE_OVERALL=mean(DEP_DELAY_NEW))

dat4 <- dat2 %>%  filter(!is.na(DEP_DEL15)) %>%
  group_by(OP_UNIQUE_CARRIER, DAY_OF_WEEK, DEP_HOUR) %>%
  mutate(PERCENTAGE=mean(DEP_DEL15))
dat5 <- dat3 %>% filter(!is.na(DEP_DELAY_NEW) & DEP_DEL15==1) %>%
  group_by(OP_UNIQUE_CARRIER,DEP_HOUR) %>%
  mutate(AVERAGE=mean(DEP_DELAY_NEW))

dat6 <- dat1 %>% filter(!is.na(DEP_DELAY_NEW) & DEP_DEL15==1) %>%
  group_by(DAY_OF_WEEK,DEP_HOUR) %>%
  mutate(AVERAGE_NEW=mean(DEP_DELAY_NEW))

Heatmap for overall delay percentage:

dat2 %>% ggplot(aes(DEP_HOUR, DAY_OF_WEEK,  fill = PERCENTAGE_OVERALL)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"),limits=c(0,1)) +
  scale_y_continuous(breaks=seq(1,7))+
  theme_minimal() +  
  theme(panel.grid = element_blank()) +
  labs(title="Overall delay percentage in each hour and each day of week",y="Day of week",x="Departure hour")

Overall, 2-4am is the time period with the highest delay rate (about 25%). On the contrary, 5-10am is the period with the lowest delay rate (about 10%). Departing in the morning might minimize your probability of encountering flight delay. Delay rate is higher on Thursday, Friday, Sunday and Monday than other days.

Heatmap for overall delay time:

dat6 %>% ggplot(aes(DEP_HOUR, DAY_OF_WEEK,  fill = AVERAGE_NEW)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds")) +
  scale_y_continuous(breaks=seq(1,7))+
  theme_minimal() +  
  theme(panel.grid = element_blank()) +
  labs(title="Overall delay time (minutes) in each hour and each day of week",y="Day of week",x="Departure hour")

Delay time is among the highest during at 1-6am and on Friday, Sunday and Monday.

Bar plot for overall average delay hours:

dat3 %>% select(DEP_HOUR,AVERAGE_OVERALL) %>%
  unique() %>%
  ggplot(aes(DEP_HOUR,AVERAGE_OVERALL))+
  geom_bar(stat="identity", fill="#720017")+
  labs(title="Overall delay time (minutes) in each hour and each day of week",y="Average delay time",x="Departure hour")

1-2am and 5-8am have relatively high delay time on avergae. According to the plot, the worst choice might be having your flight schedule at 5am, which may leading to average delay time for nearly 90 minutes.

We also designed an application to show your delay probability and delay time on average when you choose your airline and departure hour.

From the plot above, we can see that the effect of departure hour and departure day on delay rate do not vary across different airlines. The general trend is the delay rate is higher on Thursday, Friday, Sunday and Monday than other days. And the delay rate is especially high during 0 to 4am, while 5 to 10am seems to be the safest time period to avoid flight delay. Overall, Hawaiian Airlines performs the best in terms of delay rate, while JetBlue Airlines performs the worst among all 12 carriers in the database. Interstingly, if you choose to take a flight by Spirit Airlines leaving in Saturday 3am, or by United Airlines leaving in Thursday or Friday 4am, or by SkyWest Airlines leaving in 0am during Thursday to Saturday, you are going to suffer a flight delay almost 100% time.

The effect of departure hour on average delay time seems to differ among different airlines. SkyWest Airlines has astonishingly high delay time. Imagine you have a flight by SkyWest sheduled to departing at 0am, you often need to wait more than 1.5 hour for your SkyWest flight. Hawaiian Airlines has the best performance on delay time as most of flight delay time is below 1 hour.

Influence of seasons:

Sys.setenv("plotly_username"="ziwang970")
Sys.setenv("plotly_api_key"="Rh542AcijT2qJ07JZsQY")
Sys.setenv("plotly_username"="tsma29")
Sys.setenv("plotly_api_key"="7VWfMILchgTnOAX2DiZA")
# calculate mean departure delay minutes by state and by Seasons
state_delay_spring <- dat %>%  filter(DEP_DEL15==1) %>% filter(MONTH %in% c(3,4,5))%>%
  group_by(ORIGIN_STATE_ABR) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))

state_delay_summer <- dat %>% filter(DEP_DEL15==1) %>% filter(MONTH %in% c(6,7,8))%>%
  group_by(ORIGIN_STATE_ABR) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))

state_delay_autumn <- dat %>% filter(DEP_DEL15==1) %>% filter(MONTH %in% c(9,10,11))%>%
  group_by(ORIGIN_STATE_ABR) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))

state_delay_winter <- dat %>% filter(DEP_DEL15==1) %>% filter(MONTH %in% c(12,1,2))%>%
  group_by(ORIGIN_STATE_ABR) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))
# give state boundaries white borders
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

# make the plot
p_spring <- plot_geo(state_delay_spring, locationmode = 'USA-states') %>%
  add_trace(
    z = ~mean_delay, locations = ~ORIGIN_STATE_ABR,
    color = ~mean_delay, colors = 'Reds'
  ) %>%
  colorbar(title = "Departure delay(min) in spring") %>%
  layout(
    title = '2017 average departure delay (minutes) by states in Spring',
    geo = g
  )

p_summer <- plot_geo(state_delay_summer, locationmode = 'USA-states') %>%
  add_trace(
    z = ~mean_delay, locations = ~ORIGIN_STATE_ABR,
    color = ~mean_delay, colors = 'Reds'
  ) %>%
  colorbar(title = "Departure delay(min) in summer") %>%
  layout(
    title = '2017 average departure delay (minutes) by states in Summer',
    geo = g
  )

p_autumn <- plot_geo(state_delay_autumn, locationmode = 'USA-states') %>%
  add_trace(
    z = ~mean_delay, locations = ~ORIGIN_STATE_ABR,
    color = ~mean_delay, colors = 'Reds'
  ) %>%
  colorbar(title = "Departure delay(min) in autumn") %>%
  layout(
    title = '2017 average departure delay (minutes) by states in Autumn',
    geo = g
  )

p_winter <- plot_geo(state_delay_winter, locationmode = 'USA-states') %>%
  add_trace(
    z = ~mean_delay, locations = ~ORIGIN_STATE_ABR,
    color = ~mean_delay, colors = 'Reds'
  ) %>%
  colorbar(title = "Departure delay(min) in winter") %>%
  layout(
    title = '2017 average departure delay (minutes) by states in Winter',
    geo = g
  )
p_season <- subplot(p_spring, p_summer, p_autumn, p_winter, nrows = 2) %>%
  layout(title = "2017 average departure delay (minutes) by seasons",
         xaxis = list(domain=list(x=c(0,0.5),y=c(0,0.5))),
         scene = list(domain=list(x=c(0.5,1),y=c(0,0.5))),
         xaxis2 = list(domain=list(x=c(0.5,1),y=c(0.5,1))),
         annotations = list(
 list(x = 0.2 , y = 1, text = "spring", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.8 , y = 1, text = "summer", showarrow = F, xref='paper', yref='paper'),
 list(x = 0.2 , y = 0.5, text = "autumn", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.8 , y = 0.5, text = "winter", showarrow = F, xref='paper', yref='paper'))
         )
p_season

Also, seasonal changes can affect flight delay and we want to get an overall view of how delay times are distributed across different regions in four seasons. We can see the states of longest delay time varies across season. Delays are shorter during Fall and longer during Spring and Summer. You can point to each state to check the average delay time in each season.

Influence of airports:

Which Airports tends to experience more delays? Location is another important factor that can affect flight delay. We are curious about whether flight delays differ significantly among different airports, no matter for weather reasons or heavy traffic volume reasons. Hence, we choose the top 10 busiest airports in US for analysis 4.

Among the top 10 busiest airport in US, the percentage of delays is in the range of 15% to 25%. Newark Liberty International Airport has the highest delay of 25%. George Bush Intercontinental Airport and Washington Dulles International Airport has the lowest flight delays of 15%.

If the flight delays, you have to wait for 60-80 minutes for most airports. Although Washington Dulles International Airport has the lowest percentage of flight delays, it has longest average minutes of delays of more than 90 minutes. John F. Kennedy International Airport ranks second for 88 minutes of average delays. Los Angeles International Airport has the lowest average waiting time of 66 minutes.

Influence of geographical factors:

Here, we look into more details about the effect of geographical factors on the flight delays. We will use interactive maps to describe the delay patterns in different US states, cities, and by different flight routes.

In this step, we will describe the average delay times of each state:

# calculate mean departure delay minutes by state
state_delay <- dat %>% filter(DEP_DEL15==1) %>%
  group_by(ORIGIN_STATE_ABR) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))
# give state boundaries white borders
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

# make the plot
p_state <- plot_geo(state_delay, locationmode = 'USA-states') %>%
  add_trace(
    z = ~mean_delay, locations = ~ORIGIN_STATE_ABR,
    color = ~mean_delay, colors = 'Purples'
  ) %>%
  colorbar(title = "Departure delay in minutes") %>%
  layout(
    title = '2017 average departure delay (minutes) by states',
    geo = g
  )
p_state

From the plot, we see that in general, the Northeast region of the US had experienced longer delay times in 2017 (States like Maine or Vermont had average delay times over 20 minutes). For other regions, there seems to be relatively long delay times in the South and the West coast.

We then look at the delay time patterns for each departure city: The delay times are categorized into 4 quartiles and shown by colored bubbles, and the size of the bubbles depicts the length of delay time:

# calculate mean departure delay minutes by city
city_delay <- dat %>% filter(DEP_DEL15==1) %>%
  group_by(ORIGIN_CITY_NAME) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE))

city_delay <- cSplit(city_delay, "ORIGIN_CITY_NAME", sep=",")

city_delay <- city_delay %>% mutate(name = ORIGIN_CITY_NAME_1)
# add the coordination of cities
coordinate <- read.csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_us_cities.csv')


city_delay <- city_delay %>% mutate(name = trimws(as.character(name)))

coordinate <- coordinate %>% mutate(name = trimws(as.character(name)))

merged_city_delay <- left_join(city_delay,coordinate, by='name')

merged_city_delay <- merged_city_delay %>% 
  group_by(name) %>%
  summarize(mean_delay = mean(mean_delay, na.rm = TRUE), lat = mean(lat), lon = mean(lon))
# draw the plot by cities
merged_city_delay$q <- with(merged_city_delay, cut(mean_delay, quantile(mean_delay)))
levels(merged_city_delay$q) <- paste(c("1st", "2nd", "3rd", "4th", "5th"), "Quantile")
merged_city_delay$q <- as.ordered((merged_city_delay$q))


g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showland = TRUE,
  landcolor = toRGB("gray85"),
  subunitwidth = 1,
  countrywidth = 1,
  subunitcolor = toRGB("white"),
  countrycolor = toRGB("white")
)

p_cities <- plot_geo(merged_city_delay, locationmode = 'USA-states', sizes = c(1, 250)) %>%
  add_markers(
    x = ~lon, y = ~lat, size = ~mean_delay, color = ~q, hoverinfo = "text",
    text = ~paste(merged_city_delay$name, "<br />", merged_city_delay$mean_delay, "minutes")
  ) %>%
  layout(title = '2017 average departure delay (minutes) by city', geo = g)
p_cities
## Warning: Ignoring 102 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

From the plot, we see that similar to the plot by states, cities in the Northeast, South and the West coast are more likely to have delay times at the highest (yellow) or second highest (green) quartiles, with some cities (e.g. St. Augustine in Florida) reaching average delays of more than 60 minutes. Cities with the shortest average delay times are generally in the Midwest area.

Next, we look at the flight routes with delays: we will display the routes with an average delay time of 15+, 30+, 60+, and 90+ minutes in 2017:

# group by flight routes and calculate mean departure delay

route_delay <- dat %>% filter(DEP_DEL15==1) %>%
  group_by(ORIGIN_CITY_NAME, DEST_CITY_NAME) %>%
  summarize(mean_delay = mean(DEP_DELAY_NEW, na.rm = TRUE)) 


route_delay <- cSplit(route_delay, "ORIGIN_CITY_NAME", sep=",")
route_delay <- cSplit(route_delay, "DEST_CITY_NAME", sep=",")

route_delay <- route_delay %>% mutate(name1 = ORIGIN_CITY_NAME_1, name2 = DEST_CITY_NAME_1)
# add the coordination of cities
coordinate <- read.csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_us_cities.csv')

route_delay <- route_delay %>% mutate(name1 = trimws(as.character(name1)), name2 = trimws(as.character(name2)))

coordinate <- coordinate %>% mutate(name = trimws(as.character(name)))


merged_1 <- left_join(route_delay,coordinate, by = c("name1" = "name")) %>%
  rename(lat1 = lat, lon1 = lon, pop1 = pop) %>%
  select(mean_delay, name1, name2, pop1, lat1, lon1)

merged_2 <- left_join(route_delay,coordinate, by = c("name2" = "name")) %>%
  rename(lat2 = lat, lon2 = lon, pop2 = pop) %>%
  select(mean_delay, name1, name2, pop2, lat2, lon2)

merged_route_delay <- left_join(merged_1, merged_2, by = c("name1", "name2")) %>%
  rename(mean_delay = mean_delay.x) %>%
  select(mean_delay, name1, name2, pop1, lat1, lon1, pop2, lat2, lon2)

merged_route_delay <- merged_route_delay %>%      # get the mean population for each city
  group_by(name1, name2) %>%
  summarize(mean_delay = mean(mean_delay, na.rm = TRUE), 
            pop1 = mean(pop1, na.rm = TRUE), pop2 = mean(pop2, na.rm = TRUE),
            lat1 = mean(lat1, na.rm = TRUE), lon1 = mean(lon1, na.rm = TRUE),
            lat2 = mean(lat2, na.rm = TRUE), lon2 = mean(lon2, na.rm = TRUE))
# map projection

# restrict to >15, >30, >60, >90 minutes of delay
delay1 <-merged_route_delay %>%
  filter(mean_delay >= 60) 
delay2 <-merged_route_delay %>%
  filter(mean_delay >= 120) 
delay3 <-merged_route_delay %>%
  filter(mean_delay >= 180) 
delay4 <-merged_route_delay %>%
  filter(mean_delay >= 240) %>% filter(!is.na(pop1)) %>% filter(!is.na(pop2)) 

geo <- list(
  scope = 'north america',
  projection = list(type = 'azimuthal equal area'),
  showland = TRUE,
  landcolor = toRGB("gray95"),
  countrycolor = toRGB("gray80")
)


p1 <- plot_geo(locationmode = 'USA-states', color = I("red")) %>%
  add_markers(
    data = delay1, x = ~lon1, y = ~lat1, text = ~name1,
    size = ~pop1, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_markers(
    data = delay1, x = ~lon2, y = ~lat2, text = ~name2,
    size = ~pop2, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_segments(
    x = ~lon1, xend = ~lon2,
    y = ~lat1, yend = ~lat2,
    alpha = 0.3, size = I(1), hoverinfo = "none"
  ) %>%
  layout(
    title = '2017 flight routes with >60 min delay',
    geo = geo, showlegend = FALSE) 



p2 <- plot_geo(locationmode = 'USA-states', color = I("red")) %>%
  add_markers(
    data = delay1, x = ~lon1, y = ~lat1, text = ~name1,
    size = ~pop1, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_markers(
    data = delay2, x = ~lon2, y = ~lat2, text = ~name2,
    size = ~pop2, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_segments(
    x = ~lon1, xend = ~lon2,
    y = ~lat1, yend = ~lat2,
    alpha = 0.3, size = I(1), hoverinfo = "none"
  ) %>%
  layout(
    title = '2017 flight routes with >120 min delay',
    geo = geo, showlegend = FALSE)



p3 <- plot_geo(locationmode = 'USA-states', color = I("red")) %>%
  add_markers(
    data = delay3, x = ~lon1, y = ~lat1, text = ~name1,
    size = ~pop1, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_markers(
    data = delay3, x = ~lon2, y = ~lat2, text = ~name2,
    size = ~pop2, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_segments(
    x = ~lon1, xend = ~lon2,
    y = ~lat1, yend = ~lat2,
    alpha = 0.3, size = I(1), hoverinfo = "none"
  ) %>%
  layout(
    title = '2017 flight routes with >180 min delay',
    geo = geo, showlegend = FALSE )


p4 <- plot_geo(locationmode = 'USA-states', color = I("red")) %>%
  add_markers(
    data = delay3, x = ~lon1, y = ~lat1, text = ~name1,
    size = ~pop1, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_markers(
    data = delay4, x = ~lon2, y = ~lat2, text = ~name2,
    size = ~pop2, hoverinfo = "text", alpha = 0.5
  ) %>%
  add_segments(
    x = ~lon1, xend = ~lon2,
    y = ~lat1, yend = ~lat2,
    alpha = 0.3, size = I(1), hoverinfo = "none"
  ) %>%
  layout(
    title = '2017 flight routes with >240 min delay',
    geo = geo, showlegend = FALSE )
p <- subplot(p1, p2, p3, p4, nrows = 2) %>%
  layout(title = "2017 flight routes with different delay times",
         xaxis = list(domain=list(x=c(0,0.5),y=c(0,0.5))),
         scene = list(domain=list(x=c(0.5,1),y=c(0,0.5))),
         xaxis2 = list(domain=list(x=c(0.5,1),y=c(0.5,1))),
         annotations = list(
 list(x = 0.2 , y = 1, text = ">60 mins", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.8 , y = 1, text = ">120 mins", showarrow = F, xref='paper', yref='paper'),
 list(x = 0.2 , y = 0.5, text = ">180 mins", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.8 , y = 0.5, text = ">240 mins", showarrow = F, xref='paper', yref='paper'))
         )
## Warning: Ignoring 475 observations
## Warning: Ignoring 440 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.
## Warning: Ignoring 475 observations
## Warning: Ignoring 40 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.
## Warning: Ignoring 11 observations
## Warning: Ignoring 10 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.
## Warning: Ignoring 11 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.
p

We can see from the plot that as the threshold for delays increases, the number of routes with the corresponding delay time decreases. There have been many routes with average delay times of 15+ minutes in 2017, but only very few of them had an average delay of more than 60 or 90 minutes (e.g. the route between New York and San Antonio).

Part 3: Regression models

After gaining an overview of the delay patterns by various factors, we wish to make predictions of delay times. We will be using linear regression models to predict mean delay times, and logistic regression models to predict the probablity of delay (>= 15 minutes).

Predict mean delay time for each carrier / day of week / time of day using linear model

In this part, we are using linear models to predict mean delay times. Our predictors of interest are carrier, days of week and time of day, and we will be looking at them separately, both in the crude model and in the model incorporating these factors: (1) carrier, (2) month, (3) day of week, (4) distance of flight route, (5) time of day, and (6) region of departure.

# crude, predictor: carrier
df <- dat %>%
   filter(DEP_DELAY_NEW>=15)
delay.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER, data = df)
lsmeans(delay.lm, ~ OP_UNIQUE_CARRIER)
##  OP_UNIQUE_CARRIER   lsmean        SE      df lower.CL upper.CL
##  AA                64.44212 0.2136694 1013833 64.02333 64.86090
##  AS                53.62569 0.5244802 1013833 52.59773 54.65365
##  B6                72.76209 0.2904053 1013833 72.19291 73.33128
##  DL                69.02952 0.2200936 1013833 68.59814 69.46089
##  EV                85.72511 0.3212330 1013833 85.09550 86.35472
##  F9                70.47161 0.5549647 1013833 69.38390 71.55933
##  HA                48.57617 0.9886941 1013833 46.63837 50.51398
##  NK                72.09632 0.4730808 1013833 71.16909 73.02354
##  OO                85.21249 0.2381762 1013833 84.74567 85.67930
##  UA                71.50244 0.2587640 1013833 70.99528 72.00961
##  VX                60.78735 0.6073894 1013833 59.59689 61.97781
##  WN                47.68216 0.1516620 1013833 47.38491 47.97942
## 
## Confidence level used: 0.95
# adjusted, predictor: carrier
df <- df %>%
   mutate(hour_cat = cut(DEP_TIME, breaks=c(-Inf, 600, 1200, 1800, Inf), labels=c("0 to 6","6 to 12","12 to 18", "18 to 24"))) %>%
   mutate(NORTHEAST = ifelse(ORIGIN_STATE_ABR %in% c("CT","ME", "MA", "NH","RI","VT","NJ","NY","PA"), "yes", "no")) %>%
   mutate(MIDWEST = ifelse(ORIGIN_STATE_ABR %in% c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"), "yes", "no")) %>%
   mutate(SOUTH = ifelse(ORIGIN_STATE_ABR %in% c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN","AR","LA","OK","TX"), "yes", "no")) %>%
   mutate(WEST = ifelse(ORIGIN_STATE_ABR %in% c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"), "yes", "no")) %>%
   mutate(SPRING = ifelse(MONTH %in% c(3,4,5),"yes","no")) %>%
   mutate(SUMMER = ifelse(MONTH %in% c(6,7,8),"yes","no")) %>%
   mutate(FALL = ifelse(MONTH %in% c(9,10,11),"yes","no")) %>%
   mutate(WINTER = ifelse(MONTH %in% c(12,1,2),"yes","no"))

delay2.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data = df)

summary(delay2.lm)
## 
## Call:
## lm(formula = DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -148.59  -37.48  -19.50   11.24 2700.80 
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           1.331e+02  1.162e+00  114.536  < 2e-16 ***
## OP_UNIQUE_CARRIERAS  -7.145e+00  5.736e-01  -12.456  < 2e-16 ***
## OP_UNIQUE_CARRIERB6   2.683e+00  3.657e-01    7.336 2.20e-13 ***
## OP_UNIQUE_CARRIERDL   4.147e+00  3.039e-01   13.646  < 2e-16 ***
## OP_UNIQUE_CARRIEREV   2.055e+01  3.915e-01   52.492  < 2e-16 ***
## OP_UNIQUE_CARRIERF9   4.009e+00  5.898e-01    6.797 1.07e-11 ***
## OP_UNIQUE_CARRIERHA  -8.929e+00  1.011e+00   -8.835  < 2e-16 ***
## OP_UNIQUE_CARRIERNK   4.465e+00  5.134e-01    8.697  < 2e-16 ***
## OP_UNIQUE_CARRIEROO   2.426e+01  3.412e-01   71.096  < 2e-16 ***
## OP_UNIQUE_CARRIERUA   7.436e+00  3.349e-01   22.202  < 2e-16 ***
## OP_UNIQUE_CARRIERVX  -4.653e-02  6.451e-01   -0.072 0.942504    
## OP_UNIQUE_CARRIERWN  -1.508e+01  2.678e-01  -56.313  < 2e-16 ***
## MONTH                -3.640e-01  2.414e-02  -15.076  < 2e-16 ***
## factor(DAY_OF_WEEK)2 -4.149e+00  2.966e-01  -13.987  < 2e-16 ***
## factor(DAY_OF_WEEK)3 -2.696e+00  2.929e-01   -9.206  < 2e-16 ***
## factor(DAY_OF_WEEK)4 -3.668e+00  2.811e-01  -13.049  < 2e-16 ***
## factor(DAY_OF_WEEK)5 -1.016e+00  2.768e-01   -3.670 0.000243 ***
## factor(DAY_OF_WEEK)6 -1.031e+00  3.124e-01   -3.301 0.000965 ***
## factor(DAY_OF_WEEK)7 -1.402e+00  2.918e-01   -4.803 1.56e-06 ***
## DISTANCE             -8.189e-04  1.386e-04   -5.907 3.48e-09 ***
## hour_cat6 to 12      -7.637e+01  5.505e-01 -138.723  < 2e-16 ***
## hour_cat12 to 18     -7.571e+01  5.369e-01 -141.010  < 2e-16 ***
## hour_cat18 to 24     -6.301e+01  5.377e-01 -117.200  < 2e-16 ***
## NORTHEASTyes          1.071e+01  1.024e+00   10.455  < 2e-16 ***
## MIDWESTyes            5.719e+00  1.030e+00    5.551 2.83e-08 ***
## SOUTHyes              6.234e+00  1.015e+00    6.141 8.19e-10 ***
## WESTyes               1.373e+00  1.018e+00    1.349 0.177480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.81 on 1013818 degrees of freedom
## Multiple R-squared:  0.05025,    Adjusted R-squared:  0.05023 
## F-statistic:  2063 on 26 and 1013818 DF,  p-value: < 2.2e-16
lsmeans(delay2.lm, ~ OP_UNIQUE_CARRIER)
##  OP_UNIQUE_CARRIER    lsmean       SE      df  lower.CL  upper.CL
##  AA                 86.30274 1.054612 1013818  84.23574  88.36974
##  AS                 79.15797 1.153785 1013818  76.89659  81.41936
##  B6                 88.98538 1.103015 1013818  86.82350  91.14725
##  DL                 90.44956 1.050768 1013818  88.39009  92.50903
##  EV                106.85069 1.077871 1013818 104.73810 108.96328
##  F9                 90.31143 1.163980 1013818  88.03007  92.59280
##  HA                 77.37354 1.424548 1013818  74.58147  80.16561
##  NK                 90.76775 1.133621 1013818  88.54589  92.98961
##  OO                110.55841 1.056248 1013818 108.48820 112.62862
##  UA                 93.73895 1.062516 1013818  91.65646  95.82145
##  VX                 86.25621 1.192479 1013818  83.91899  88.59343
##  WN                 71.22161 1.041578 1013818  69.18016  73.26307
## 
## Results are averaged over the levels of: DAY_OF_WEEK, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95

We summarized the above results into the table below: Table 1

From the adjusted predicted mean delay times for each carrier, we see that when assuming all other factors are on average, Hawaiian Airline in general has the shortest predicted delay time, followed by Alaska Airline, American Airline and Delta Airline; and in general JetBlue Airline has the longest predicted delay times.

# crude, predictor: day of week
delay_day.lm = lm(DEP_DELAY_NEW ~ factor(DAY_OF_WEEK), data = df)
lsmeans(delay_day.lm, ~ DAY_OF_WEEK)
##  DAY_OF_WEEK   lsmean        SE      df lower.CL upper.CL
##            1 68.09341 0.2035731 1013838 67.69442 68.49241
##            2 63.18018 0.2259230 1013838 62.73738 63.62298
##            3 64.22169 0.2206569 1013838 63.78921 64.65417
##            4 63.26493 0.2038631 1013838 62.86537 63.66450
##            5 66.05751 0.1975888 1013838 65.67024 66.44478
##            6 64.94108 0.2464242 1013838 64.45809 65.42406
##            7 66.71645 0.2191147 1013838 66.28700 67.14591
## 
## Confidence level used: 0.95
# adjusted, predictor: day of week
lsmeans(delay2.lm, ~ DAY_OF_WEEK )
##  DAY_OF_WEEK   lsmean       SE      df lower.CL upper.CL
##            1 91.32565 1.054609 1013818 89.25865 93.39265
##            2 87.17708 1.059718 1013818 85.10007 89.25409
##            3 88.62944 1.058722 1013818 86.55438 90.70450
##            4 87.65786 1.054795 1013818 85.59050 89.72522
##            5 90.30979 1.053053 1013818 88.24584 92.37374
##            6 90.29442 1.067013 1013818 88.20311 92.38573
##            7 89.92408 1.058917 1013818 87.84864 91.99952
## 
## Results are averaged over the levels of: OP_UNIQUE_CARRIER, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95

We summarized the above results into the table below: Table 2

From the adjusted predicted mean delay times by each day of week, we that that when assuming all other factors are on average, going on a flight on Tuesday, Wednesday, or Saturday would generally have shorter delay times, while leaving on Friday would probably lead to longer delay.

# crude, predictor: time of day
delay_day.lm = lm(DEP_DELAY_NEW ~ factor(hour_cat), data = df)
lsmeans(delay_day.lm, ~ hour_cat)
##  hour_cat    lsmean        SE      df  lower.CL  upper.CL
##  0 to 6   134.31299 0.5217897 1013841 133.29030 135.33568
##  6 to 12   59.61137 0.1806714 1013841  59.25726  59.96548
##  12 to 18  58.66513 0.1283653 1013841  58.41354  58.91672
##  18 to 24  70.69035 0.1295534 1013841  70.43643  70.94427
## 
## Confidence level used: 0.95
# adjusted, predictor: time of day
lsmeans(delay2.lm, ~ hour_cat )
##  hour_cat    lsmean       SE      df  lower.CL  upper.CL
##  0 to 6   143.10500 1.174924 1013818 140.80219 145.40781
##  6 to 12   66.73209 1.035838 1013818  64.70188  68.76229
##  12 to 18  67.39661 1.032882 1013818  65.37220  69.42103
##  18 to 24  80.09105 1.027952 1013818  78.07630  82.10580
## 
## Results are averaged over the levels of: OP_UNIQUE_CARRIER, DAY_OF_WEEK, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95

We summarized the above results into the table below: Table 3

From the adjusted model, we see that when assuming all other factors are on average, going on a flight at in the morning (6:00 to 12:00) would generally have shorter delay times, while leaving at night (18:00 to 24:00) would likely result in longer delay.

# stratification
# predictor: carrier
# stratified by day of week
lsmeans(delay2.lm, ~ OP_UNIQUE_CARRIER*DAY_OF_WEEK )
##  OP_UNIQUE_CARRIER DAY_OF_WEEK    lsmean       SE      df  lower.CL
##  AA                          1  88.29720 1.069390 1013818  86.20123
##  AS                          1  81.15243 1.167325 1013818  78.86452
##  B6                          1  90.97984 1.117285 1013818  88.78999
##  DL                          1  92.44402 1.064821 1013818  90.35701
##  EV                          1 108.84515 1.091626 1013818 106.70560
##  F9                          1  92.30589 1.177318 1013818  89.99839
##  HA                          1  79.36800 1.436411 1013818  76.55268
##  NK                          1  92.76221 1.147410 1013818  90.51332
##  OO                          1 112.55287 1.070404 1013818 110.45492
##  UA                          1  95.73341 1.076705 1013818  93.62311
##  VX                          1  88.25067 1.205014 1013818  85.88888
##  WN                          1  73.21607 1.056875 1013818  71.14463
##  AA                          2  84.14863 1.074666 1013818  82.04232
##  AS                          2  77.00386 1.171917 1013818  74.70695
##  B6                          2  86.83127 1.121694 1013818  84.63278
##  DL                          2  88.29545 1.070569 1013818  86.19717
##  EV                          2 104.69658 1.096897 1013818 102.54670
##  F9                          2  88.15732 1.181370 1013818  85.84188
##  HA                          2  75.21943 1.439157 1013818  72.39873
##  NK                          2  88.61364 1.152254 1013818  86.35526
##  OO                          2 108.40430 1.075574 1013818 106.29621
##  UA                          2  91.58484 1.082112 1013818  89.46394
##  VX                          2  84.10210 1.210309 1013818  81.72994
##  WN                          2  69.06750 1.061475 1013818  66.98705
##  AA                          3  85.60099 1.073324 1013818  83.49732
##  AS                          3  78.45623 1.170824 1013818  76.16145
##  B6                          3  88.28363 1.121069 1013818  86.08637
##  DL                          3  89.74781 1.070117 1013818  87.65042
##  EV                          3 106.14895 1.096250 1013818 104.00033
##  F9                          3  89.60969 1.180781 1013818  87.29540
##  HA                          3  76.67179 1.438230 1013818  73.85291
##  NK                          3  90.06600 1.151269 1013818  87.80955
##  OO                          3 109.85667 1.074980 1013818 107.74974
##  UA                          3  93.03721 1.080647 1013818  90.91917
##  VX                          3  85.55446 1.209239 1013818  83.18440
##  WN                          3  70.51987 1.060196 1013818  68.44192
##  AA                          4  84.62942 1.069538 1013818  82.53316
##  AS                          4  77.48465 1.166991 1013818  75.19739
##  B6                          4  87.31205 1.117386 1013818  85.12201
##  DL                          4  88.77624 1.065349 1013818  86.68819
##  EV                          4 105.17737 1.092415 1013818 103.03627
##  F9                          4  88.63811 1.178259 1013818  86.32876
##  HA                          4  75.70022 1.436200 1013818  72.88531
##  NK                          4  89.09442 1.147609 1013818  86.84515
##  OO                          4 108.88509 1.071016 1013818 106.78593
##  UA                          4  92.06563 1.076844 1013818  89.95505
##  VX                          4  84.58289 1.205170 1013818  82.22079
##  WN                          4  69.54829 1.055951 1013818  67.47866
##  AA                          5  87.28134 1.067753 1013818  85.18858
##  AS                          5  80.13658 1.165795 1013818  77.85166
##  B6                          5  89.96398 1.115595 1013818  87.77745
##  DL                          5  91.42816 1.063559 1013818  89.34362
##  EV                          5 107.82930 1.090716 1013818 105.69153
##  F9                          5  91.29004 1.177092 1013818  88.98298
##  HA                          5  78.35214 1.433452 1013818  75.54263
##  NK                          5  91.74635 1.146215 1013818  89.49981
##  OO                          5 111.53701 1.069719 1013818 109.44040
##  UA                          5  94.71755 1.075473 1013818  92.60966
##  VX                          5  87.23481 1.203664 1013818  84.87567
##  WN                          5  72.20022 1.054648 1013818  70.13314
##  AA                          6  87.26597 1.081337 1013818  85.14659
##  AS                          6  80.12120 1.178615 1013818  77.81116
##  B6                          6  89.94860 1.128415 1013818  87.73695
##  DL                          6  91.41279 1.078444 1013818  89.29908
##  EV                          6 107.81392 1.105340 1013818 105.64749
##  F9                          6  91.27466 1.187159 1013818  88.94787
##  HA                          6  78.33677 1.443095 1013818  75.50835
##  NK                          6  91.73098 1.157420 1013818  89.46247
##  OO                          6 111.52164 1.083596 1013818 109.39783
##  UA                          6  94.70218 1.090356 1013818  92.56512
##  VX                          6  87.21944 1.217311 1013818  84.83355
##  WN                          6  72.18484 1.069079 1013818  70.08948
##  AA                          7  86.89563 1.073037 1013818  84.79252
##  AS                          7  79.75087 1.171064 1013818  77.45562
##  B6                          7  89.57827 1.120954 1013818  87.38124
##  DL                          7  91.04245 1.069725 1013818  88.94583
##  EV                          7 107.44359 1.095915 1013818 105.29563
##  F9                          7  90.90433 1.180931 1013818  88.58974
##  HA                          7  77.96643 1.439523 1013818  75.14502
##  NK                          7  91.36064 1.151247 1013818  89.10423
##  OO                          7 111.15130 1.075013 1013818 109.04432
##  UA                          7  94.33185 1.081300 1013818  92.21253
##  VX                          7  86.84910 1.208960 1013818  84.47958
##  WN                          7  71.81451 1.061135 1013818  69.73472
##   upper.CL
##   90.39317
##   83.44035
##   93.16968
##   94.53103
##  110.98470
##   94.61340
##   82.18332
##   95.01109
##  114.65083
##   97.84372
##   90.61246
##   75.28751
##   86.25494
##   79.30078
##   89.02975
##   90.39373
##  106.84646
##   90.47277
##   78.04013
##   90.87202
##  110.51239
##   93.70575
##   86.47427
##   71.14796
##   87.70467
##   80.75100
##   90.48089
##   91.84521
##  108.29756
##   91.92398
##   79.49068
##   92.32245
##  111.96359
##   95.15524
##   87.92453
##   72.59782
##   86.72567
##   79.77191
##   89.50209
##   90.86428
##  107.31846
##   90.94746
##   78.51512
##   91.34370
##  110.98424
##   94.17620
##   86.94498
##   71.61792
##   89.37410
##   82.42150
##   92.15051
##   93.51270
##  109.96706
##   93.59710
##   81.16166
##   93.99289
##  113.63363
##   96.82545
##   89.59396
##   74.26729
##   89.38535
##   82.43125
##   92.16026
##   93.52650
##  109.98035
##   93.60146
##   81.16519
##   93.99948
##  113.64545
##   96.83924
##   89.60533
##   74.28020
##   88.99875
##   82.04611
##   91.77530
##   93.13908
##  109.59154
##   93.21891
##   80.78785
##   93.61704
##  113.25829
##   96.45116
##   89.21862
##   73.89430
## 
## Results are averaged over the levels of: hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95

We summarized the above results (stratified by day of week) into the table below: Table 4 We see that similar to our previous findings, on average the delay times on Tuesdays are the shortest, and on Tuesday the carrier with the shortest predicted delay is Hawaiian Airline. Likewise, the predicted delay times for Fridays are the highest, and on Friday the carrier with the longest predicted delay is JetBlue. So probably not a good idea to leave on Friday on a JetBlue flight!

# stratification
# predictor: carrier
# stratified by time of day
lsmeans(delay2.lm, ~ OP_UNIQUE_CARRIER*hour_cat )
##  OP_UNIQUE_CARRIER hour_cat    lsmean       SE      df  lower.CL  upper.CL
##  AA                0 to 6   140.07656 1.188703 1013818 137.74674 142.40638
##  AS                0 to 6   132.93179 1.278937 1013818 130.42512 135.43846
##  B6                0 to 6   142.75919 1.224120 1013818 140.35996 145.15842
##  DL                0 to 6   144.22338 1.184001 1013818 141.90278 146.54398
##  EV                0 to 6   160.62451 1.209686 1013818 158.25356 162.99545
##  F9                0 to 6   144.08525 1.279970 1013818 141.57655 146.59395
##  HA                0 to 6   131.14736 1.530012 1013818 128.14858 134.14613
##  NK                0 to 6   144.54156 1.253775 1013818 142.08420 146.99892
##  OO                0 to 6   164.33223 1.192376 1013818 161.99521 166.66924
##  UA                0 to 6   147.51277 1.195509 1013818 145.16961 149.85593
##  VX                0 to 6   140.03003 1.316445 1013818 137.44984 142.61021
##  WN                0 to 6   124.99543 1.179585 1013818 122.68348 127.30738
##  AA                6 to 12   63.70364 1.050290 1013818  61.64511  65.76217
##  AS                6 to 12   56.55887 1.148643 1013818  54.30757  58.81018
##  B6                6 to 12   66.38627 1.102992 1013818  64.22445  68.54810
##  DL                6 to 12   67.85046 1.047083 1013818  65.79821  69.90271
##  EV                6 to 12   84.25159 1.073163 1013818  82.14823  86.35495
##  F9                6 to 12   67.71233 1.162382 1013818  65.43410  69.99056
##  HA                6 to 12   54.77444 1.420556 1013818  51.99020  57.55868
##  NK                6 to 12   68.16865 1.131908 1013818  65.95014  70.38715
##  OO                6 to 12   87.95931 1.050498 1013818  85.90037  90.01825
##  UA                6 to 12   71.13985 1.058600 1013818  69.06503  73.21467
##  VX                6 to 12   63.65711 1.187310 1013818  61.33002  65.98420
##  WN                6 to 12   48.62251 1.038192 1013818  46.58769  50.65734
##  AA                12 to 18  64.36817 1.047627 1013818  62.31485  66.42148
##  AS                12 to 18  57.22340 1.147739 1013818  54.97387  59.47293
##  B6                12 to 18  67.05080 1.099430 1013818  64.89596  69.20565
##  DL                12 to 18  68.51499 1.044078 1013818  66.46863  70.56135
##  EV                12 to 18  84.91612 1.070504 1013818  82.81797  87.01427
##  F9                12 to 18  68.37686 1.160889 1013818  66.10156  70.65216
##  HA                12 to 18  55.43897 1.416716 1013818  52.66225  58.21568
##  NK                12 to 18  68.83317 1.130041 1013818  66.61833  71.04802
##  OO                12 to 18  88.62384 1.047954 1013818  86.56988  90.67779
##  UA                12 to 18  71.80438 1.055664 1013818  69.73531  73.87345
##  VX                12 to 18  64.32164 1.185139 1013818  61.99880  66.64447
##  WN                12 to 18  49.28704 1.033076 1013818  47.26225  51.31184
##  AA                18 to 24  77.06260 1.042952 1013818  75.01845  79.10675
##  AS                18 to 24  69.91783 1.142406 1013818  67.67876  72.15691
##  B6                18 to 24  79.74523 1.093314 1013818  77.60238  81.88809
##  DL                18 to 24  81.20942 1.039563 1013818  79.17191  83.24693
##  EV                18 to 24  97.61055 1.067146 1013818  95.51898  99.70212
##  F9                18 to 24  81.07129 1.155358 1013818  78.80683  83.33576
##  HA                18 to 24  68.13340 1.415601 1013818  65.35887  70.90793
##  NK                18 to 24  81.52761 1.123743 1013818  79.32511  83.73010
##  OO                18 to 24 101.31827 1.044845 1013818  99.27041 103.36613
##  UA                18 to 24  84.49881 1.050731 1013818  82.43941  86.55821
##  VX                18 to 24  77.01607 1.180387 1013818  74.70255  79.32959
##  WN                18 to 24  61.98147 1.027424 1013818  59.96776  63.99519
## 
## Results are averaged over the levels of: DAY_OF_WEEK, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95

We summarized the above results (stratified by time of day) into the table below: Table 5 We see that similar to our previous findings, on average the delay times when leaving between 6:00 to 12:00 in the morning are the shortest, and at that time period the carrier with the shortest predicted delay is still Hawaiian Airline. Likewise, the predicted delay times for 18:00 to 24:00 are the highest, and at that time period the carrier with the longest predicted delay is still JetBlue.

Since weather could impact delays, and weather patterns are often related to seasons, we also wish stratify by season to see if there are any differences:

# stratification
# predictor: carrier
# stratified by season

# Spring
Spring <- df %>%
  filter (SPRING=="yes")

delay_spring.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data = Spring)

summary(delay_spring.lm)
## 
## Call:
## lm(formula = DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     data = Spring)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -159.73  -38.78  -19.33   12.25 1757.68 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.302e+02  2.427e+00  53.657  < 2e-16 ***
## OP_UNIQUE_CARRIERAS  -7.917e+00  1.141e+00  -6.938 3.98e-12 ***
## OP_UNIQUE_CARRIERB6   5.910e+00  7.099e-01   8.325  < 2e-16 ***
## OP_UNIQUE_CARRIERDL   1.217e+01  5.778e-01  21.055  < 2e-16 ***
## OP_UNIQUE_CARRIEREV   2.379e+01  7.331e-01  32.446  < 2e-16 ***
## OP_UNIQUE_CARRIERF9   4.480e+00  1.257e+00   3.565 0.000363 ***
## OP_UNIQUE_CARRIERHA  -9.911e+00  2.086e+00  -4.750 2.03e-06 ***
## OP_UNIQUE_CARRIERNK   5.923e+00  9.751e-01   6.074 1.25e-09 ***
## OP_UNIQUE_CARRIEROO   2.271e+01  6.876e-01  33.033  < 2e-16 ***
## OP_UNIQUE_CARRIERUA   7.155e+00  6.660e-01  10.744  < 2e-16 ***
## OP_UNIQUE_CARRIERVX   3.306e+00  1.178e+00   2.806 0.005018 ** 
## OP_UNIQUE_CARRIERWN  -1.577e+01  5.327e-01 -29.600  < 2e-16 ***
## MONTH                 9.422e-01  1.916e-01   4.916 8.83e-07 ***
## factor(DAY_OF_WEEK)2 -4.050e+00  5.913e-01  -6.849 7.47e-12 ***
## factor(DAY_OF_WEEK)3 -2.461e+00  5.648e-01  -4.357 1.32e-05 ***
## factor(DAY_OF_WEEK)4  6.977e-01  5.524e-01   1.263 0.206626    
## factor(DAY_OF_WEEK)5 -6.016e-01  5.485e-01  -1.097 0.272667    
## factor(DAY_OF_WEEK)6 -3.240e+00  6.328e-01  -5.120 3.06e-07 ***
## factor(DAY_OF_WEEK)7 -2.411e+00  5.810e-01  -4.149 3.34e-05 ***
## DISTANCE             -1.835e-03  2.706e-04  -6.782 1.19e-11 ***
## hour_cat6 to 12      -8.800e+01  1.090e+00 -80.742  < 2e-16 ***
## hour_cat12 to 18     -8.536e+01  1.062e+00 -80.408  < 2e-16 ***
## hour_cat18 to 24     -7.232e+01  1.063e+00 -68.063  < 2e-16 ***
## NORTHEASTyes          1.779e+01  2.060e+00   8.636  < 2e-16 ***
## MIDWESTyes            1.277e+01  2.077e+00   6.150 7.78e-10 ***
## SOUTHyes              1.387e+01  2.045e+00   6.781 1.20e-11 ***
## WESTyes               7.427e+00  2.052e+00   3.619 0.000296 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.89 on 269489 degrees of freedom
## Multiple R-squared:  0.05893,    Adjusted R-squared:  0.05884 
## F-statistic: 649.1 on 26 and 269489 DF,  p-value: < 2.2e-16
lsmeans(delay_spring.lm, ~ OP_UNIQUE_CARRIER )
##  OP_UNIQUE_CARRIER    lsmean       SE     df  lower.CL  upper.CL
##  AA                 95.20257 2.126453 269489  91.03478  99.37036
##  AS                 87.28583 2.320574 269489  82.73757  91.83409
##  B6                101.11268 2.217500 269489  96.76644 105.45892
##  DL                107.36821 2.109018 269489 103.23459 111.50183
##  EV                118.98930 2.157419 269489 114.76082 123.21778
##  F9                 99.68291 2.375921 269489  95.02617 104.33965
##  HA                 85.29198 2.906236 269489  79.59583  90.98812
##  NK                101.12551 2.261746 269489  96.69255 105.55848
##  OO                117.91495 2.131479 269489 113.73731 122.09259
##  UA                102.35757 2.144668 269489  98.15408 106.56107
##  VX                 98.50869 2.342861 269489  93.91675 103.10064
##  WN                 79.43340 2.099894 269489  75.31766  83.54913
## 
## Results are averaged over the levels of: DAY_OF_WEEK, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95
# Summer
Summer <- df %>%
  filter (SUMMER=="yes")

delay_summer.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data = Summer)

summary(delay_summer.lm)
## 
## Call:
## lm(formula = DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     data = Summer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.61  -37.26  -18.91   12.31 1845.96 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.412e+02  2.227e+00  63.407  < 2e-16 ***
## OP_UNIQUE_CARRIERAS  -1.128e+01  1.069e+00 -10.545  < 2e-16 ***
## OP_UNIQUE_CARRIERB6   3.339e+00  6.251e-01   5.342 9.21e-08 ***
## OP_UNIQUE_CARRIERDL  -4.640e+00  5.313e-01  -8.734  < 2e-16 ***
## OP_UNIQUE_CARRIEREV   1.756e+01  7.167e-01  24.508  < 2e-16 ***
## OP_UNIQUE_CARRIERF9   3.553e+00  1.030e+00   3.449 0.000563 ***
## OP_UNIQUE_CARRIERHA  -8.507e+00  2.271e+00  -3.746 0.000180 ***
## OP_UNIQUE_CARRIERNK   4.477e-01  8.934e-01   0.501 0.616299    
## OP_UNIQUE_CARRIEROO   2.330e+01  6.012e-01  38.756  < 2e-16 ***
## OP_UNIQUE_CARRIERUA   8.181e+00  5.775e-01  14.167  < 2e-16 ***
## OP_UNIQUE_CARRIERVX  -3.956e+00  1.206e+00  -3.279 0.001041 ** 
## OP_UNIQUE_CARRIERWN  -1.553e+01  4.532e-01 -34.256  < 2e-16 ***
## MONTH                -1.561e+00  1.721e-01  -9.072  < 2e-16 ***
## factor(DAY_OF_WEEK)2 -6.234e+00  5.203e-01 -11.983  < 2e-16 ***
## factor(DAY_OF_WEEK)3 -1.968e+00  5.180e-01  -3.798 0.000146 ***
## factor(DAY_OF_WEEK)4 -7.407e+00  4.910e-01 -15.085  < 2e-16 ***
## factor(DAY_OF_WEEK)5 -1.730e+00  4.876e-01  -3.547 0.000389 ***
## factor(DAY_OF_WEEK)6 -5.480e+00  5.448e-01 -10.058  < 2e-16 ***
## factor(DAY_OF_WEEK)7 -7.775e+00  5.276e-01 -14.737  < 2e-16 ***
## DISTANCE             -2.710e-04  2.462e-04  -1.101 0.271042    
## hour_cat6 to 12      -7.465e+01  8.799e-01 -84.845  < 2e-16 ***
## hour_cat12 to 18     -7.656e+01  8.514e-01 -89.920  < 2e-16 ***
## hour_cat18 to 24     -6.121e+01  8.488e-01 -72.118  < 2e-16 ***
## NORTHEASTyes          1.828e+01  1.690e+00  10.815  < 2e-16 ***
## MIDWESTyes            9.585e+00  1.703e+00   5.629 1.81e-08 ***
## SOUTHyes              1.115e+01  1.674e+00   6.662 2.71e-11 ***
## WESTyes               3.631e+00  1.681e+00   2.161 0.030730 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.05 on 322346 degrees of freedom
## Multiple R-squared:  0.05862,    Adjusted R-squared:  0.05854 
## F-statistic:   772 on 26 and 322346 DF,  p-value: < 2.2e-16
lsmeans(delay_summer.lm, ~ OP_UNIQUE_CARRIER )
##  OP_UNIQUE_CARRIER    lsmean       SE     df  lower.CL  upper.CL
##  AA                 93.91076 1.741913 322346  90.49666  97.32486
##  AS                 82.63575 1.966615 322346  78.78124  86.49026
##  B6                 97.24982 1.838126 322346  93.64714 100.85249
##  DL                 89.27028 1.741303 322346  85.85737  92.68318
##  EV                111.47517 1.804076 322346 107.93923 115.01111
##  F9                 97.46361 1.953448 322346  93.63491 101.29231
##  HA                 85.40358 2.809316 322346  79.89740  90.90976
##  NK                 94.35843 1.887368 322346  90.65924  98.05762
##  OO                117.21112 1.751853 322346 113.77754 120.64471
##  UA                102.09201 1.758640 322346  98.64513 105.53890
##  VX                 89.95459 2.046975 322346  85.94258  93.96660
##  WN                 78.38513 1.719244 322346  75.01546  81.75480
## 
## Results are averaged over the levels of: DAY_OF_WEEK, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95
# Fall
Fall <- df %>%
  filter (FALL=="yes")

delay_fall.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data = Fall)

summary(delay_fall.lm)
## 
## Call:
## lm(formula = DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     data = Fall)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -131.32  -34.73  -19.23    9.04 1724.55 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.351e+02  3.668e+00  36.837  < 2e-16 ***
## OP_UNIQUE_CARRIERAS  -4.002e+00  1.255e+00  -3.188  0.00143 ** 
## OP_UNIQUE_CARRIERB6   1.282e+00  8.557e-01   1.498  0.13421    
## OP_UNIQUE_CARRIERDL  -1.960e-01  7.176e-01  -0.273  0.78474    
## OP_UNIQUE_CARRIEREV   2.041e+01  9.361e-01  21.800  < 2e-16 ***
## OP_UNIQUE_CARRIERF9   6.929e+00  1.281e+00   5.411 6.29e-08 ***
## OP_UNIQUE_CARRIERHA  -1.309e+01  2.243e+00  -5.839 5.27e-09 ***
## OP_UNIQUE_CARRIERNK   8.741e+00  1.208e+00   7.233 4.74e-13 ***
## OP_UNIQUE_CARRIEROO   2.527e+01  7.574e-01  33.363  < 2e-16 ***
## OP_UNIQUE_CARRIERUA   7.007e+00  7.627e-01   9.187  < 2e-16 ***
## OP_UNIQUE_CARRIERVX  -8.588e-02  1.417e+00  -0.061  0.95167    
## OP_UNIQUE_CARRIERWN  -1.440e+01  6.159e-01 -23.390  < 2e-16 ***
## MONTH                -1.111e+00  2.302e-01  -4.826 1.39e-06 ***
## factor(DAY_OF_WEEK)2 -1.873e+00  6.750e-01  -2.776  0.00551 ** 
## factor(DAY_OF_WEEK)3 -5.050e+00  6.682e-01  -7.558 4.10e-14 ***
## factor(DAY_OF_WEEK)4 -1.979e+00  6.326e-01  -3.129  0.00176 ** 
## factor(DAY_OF_WEEK)5 -1.648e+00  6.177e-01  -2.668  0.00763 ** 
## factor(DAY_OF_WEEK)6 -4.049e-01  7.489e-01  -0.541  0.58877    
## factor(DAY_OF_WEEK)7  1.091e+00  6.318e-01   1.727  0.08419 .  
## DISTANCE              5.553e-04  3.146e-04   1.765  0.07751 .  
## hour_cat6 to 12      -6.337e+01  1.452e+00 -43.650  < 2e-16 ***
## hour_cat12 to 18     -6.276e+01  1.427e+00 -43.993  < 2e-16 ***
## hour_cat18 to 24     -5.290e+01  1.431e+00 -36.974  < 2e-16 ***
## NORTHEASTyes         -2.641e+00  2.422e+00  -1.090  0.27552    
## MIDWESTyes           -4.301e+00  2.426e+00  -1.773  0.07625 .  
## SOUTHyes             -5.913e+00  2.397e+00  -2.467  0.01361 *  
## WESTyes              -5.770e+00  2.402e+00  -2.403  0.01628 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.51 on 180927 degrees of freedom
## Multiple R-squared:  0.04286,    Adjusted R-squared:  0.04273 
## F-statistic: 311.6 on 26 and 180927 DF,  p-value: < 2.2e-16
lsmeans(delay_fall.lm, ~ OP_UNIQUE_CARRIER )
##  OP_UNIQUE_CARRIER   lsmean       SE     df lower.CL upper.CL
##  AA                69.03222 2.482795 180927 64.16599 73.89844
##  AS                65.03069 2.673251 180927 59.79118 70.27020
##  B6                70.31379 2.574037 180927 65.26874 75.35885
##  DL                68.83619 2.486525 180927 63.96266 73.70972
##  EV                89.43838 2.545076 180927 84.45009 94.42667
##  F9                75.96079 2.698625 180927 70.67155 81.25003
##  HA                55.93765 3.263403 180927 49.54146 62.33385
##  NK                77.77343 2.682119 180927 72.51653 83.03032
##  OO                94.30103 2.478411 180927 89.44340 99.15866
##  UA                76.03957 2.489662 180927 71.15989 80.91925
##  VX                68.94633 2.754938 180927 63.54672 74.34595
##  WN                54.62746 2.455069 180927 49.81558 59.43934
## 
## Results are averaged over the levels of: DAY_OF_WEEK, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95
# release some memory
rm(Spring)
rm(Summer)
rm(Fall)
# Winter
Winter <- df %>%
  filter (WINTER=="yes")

delay_winter.lm = lm(DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data = Winter)

summary(delay_winter.lm)
## 
## Call:
## lm(formula = DEP_DELAY_NEW ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     data = Winter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -143.82  -36.96  -19.53   10.63 2704.37 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.279e+02  2.540e+00  50.363  < 2e-16 ***
## OP_UNIQUE_CARRIERAS  -3.481e+00  1.145e+00  -3.041 0.002362 ** 
## OP_UNIQUE_CARRIERB6  -1.058e+00  7.911e-01  -1.337 0.181189    
## OP_UNIQUE_CARRIERDL   8.237e+00  6.526e-01  12.622  < 2e-16 ***
## OP_UNIQUE_CARRIEREV   2.209e+01  8.053e-01  27.430  < 2e-16 ***
## OP_UNIQUE_CARRIERF9   3.582e+00  1.194e+00   3.001 0.002689 ** 
## OP_UNIQUE_CARRIERHA  -6.199e+00  1.703e+00  -3.640 0.000273 ***
## OP_UNIQUE_CARRIERNK   5.400e+00  1.106e+00   4.880 1.06e-06 ***
## OP_UNIQUE_CARRIEROO   2.710e+01  7.120e-01  38.068  < 2e-16 ***
## OP_UNIQUE_CARRIERUA   8.104e+00  7.113e-01  11.393  < 2e-16 ***
## OP_UNIQUE_CARRIERVX   2.535e+00  1.412e+00   1.795 0.072614 .  
## OP_UNIQUE_CARRIERWN  -1.400e+01  5.792e-01 -24.166  < 2e-16 ***
## MONTH                -1.288e-01  3.333e-02  -3.863 0.000112 ***
## factor(DAY_OF_WEEK)2 -3.665e+00  6.137e-01  -5.972 2.35e-09 ***
## factor(DAY_OF_WEEK)3 -3.502e+00  6.239e-01  -5.613 1.99e-08 ***
## factor(DAY_OF_WEEK)4 -5.977e+00  6.007e-01  -9.951  < 2e-16 ***
## factor(DAY_OF_WEEK)5 -8.684e-01  5.838e-01  -1.488 0.136859    
## factor(DAY_OF_WEEK)6  4.928e+00  6.265e-01   7.866 3.69e-15 ***
## factor(DAY_OF_WEEK)7  6.164e+00  6.045e-01  10.198  < 2e-16 ***
## DISTANCE             -1.530e-03  2.892e-04  -5.290 1.22e-07 ***
## hour_cat6 to 12      -6.894e+01  1.214e+00 -56.787  < 2e-16 ***
## hour_cat12 to 18     -6.721e+01  1.191e+00 -56.447  < 2e-16 ***
## hour_cat18 to 24     -5.716e+01  1.196e+00 -47.781  < 2e-16 ***
## NORTHEASTyes          1.478e+00  2.229e+00   0.663 0.507421    
## MIDWESTyes           -6.304e-01  2.238e+00  -0.282 0.778186    
## SOUTHyes             -9.811e-01  2.208e+00  -0.444 0.656834    
## WESTyes              -3.180e+00  2.212e+00  -1.437 0.150583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.45 on 240975 degrees of freedom
## Multiple R-squared:  0.04512,    Adjusted R-squared:  0.04501 
## F-statistic: 437.9 on 26 and 240975 DF,  p-value: < 2.2e-16
lsmeans(delay_winter.lm, ~ OP_UNIQUE_CARRIER )
##  OP_UNIQUE_CARRIER    lsmean       SE     df lower.CL  upper.CL
##  AA                 75.47479 2.290322 240975 70.98582  79.96376
##  AS                 71.99368 2.457574 240975 67.17691  76.81046
##  B6                 74.41697 2.387680 240975 69.73718  79.09676
##  DL                 83.71220 2.278869 240975 79.24567  88.17872
##  EV                 97.56347 2.323690 240975 93.00910 102.11784
##  F9                 79.05717 2.479242 240975 74.19792  83.91642
##  HA                 69.27538 2.770017 240975 63.84622  74.70455
##  NK                 80.87487 2.457624 240975 76.05799  85.69175
##  OO                102.57875 2.287613 240975 98.09509 107.06241
##  UA                 83.57897 2.306490 240975 79.05831  88.09963
##  VX                 78.01017 2.595625 240975 72.92282  83.09753
##  WN                 61.47852 2.262378 240975 57.04432  65.91272
## 
## Results are averaged over the levels of: DAY_OF_WEEK, hour_cat, NORTHEAST, MIDWEST, SOUTH, WEST 
## Confidence level used: 0.95
# release some memory
rm(delay.lm)
rm(delay_day.lm)
rm(delay_fall.lm)
rm(delay_hour.lm)
## Warning in rm(delay_hour.lm): 找不到对象'delay_hour.lm'
rm(delay_spring.lm)
rm(delay_summer.lm)
rm(delay_winter.lm)
rm(delay2.lm)
rm(Winter)

We summarized the above results (stratified by season) into the table below: Table 6

We first see that in general, delays are much shorter during Fall for all carriers. Overall, Hawaiian Airline still show the shortest predicted delays for most seasons (except during Winter, where Alaska seems to be doing better). JetBlue has the longest delay time during Summer. In other seasons, some other carriers seem to have longer predicted delays than JetBlue.

Predict delay probability for each carrier (stratified by season)

Since there might be extreme values in the delay times, we also wish to dichotimize delays into a binary variable (delaying 15+ minutes vs. delaying < 15 minutes or no delay) and see how these carriers perform, using logistic regression models:

# predictor: carrier (adjusted)
# Spring
Spring <- df %>%
  filter (SPRING=="yes")

logit_model <- glm(DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data=Spring, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(logit_model)
## 
## Call:
## glm(formula = DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     family = "binomial", data = Spring)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 2.409e-06  2.409e-06  2.409e-06  2.409e-06  2.409e-06  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           2.657e+01  1.069e+04   0.002    0.998
## OP_UNIQUE_CARRIERAS  -3.866e-09  5.024e+03   0.000    1.000
## OP_UNIQUE_CARRIERB6  -1.126e-09  3.125e+03   0.000    1.000
## OP_UNIQUE_CARRIERDL  -1.738e-09  2.544e+03   0.000    1.000
## OP_UNIQUE_CARRIEREV   3.649e-09  3.228e+03   0.000    1.000
## OP_UNIQUE_CARRIERF9  -3.154e-09  5.532e+03   0.000    1.000
## OP_UNIQUE_CARRIERHA  -8.558e-09  9.186e+03   0.000    1.000
## OP_UNIQUE_CARRIERNK   1.272e-08  4.293e+03   0.000    1.000
## OP_UNIQUE_CARRIEROO  -6.548e-09  3.027e+03   0.000    1.000
## OP_UNIQUE_CARRIERUA  -3.160e-09  2.932e+03   0.000    1.000
## OP_UNIQUE_CARRIERVX  -4.949e-09  5.188e+03   0.000    1.000
## OP_UNIQUE_CARRIERWN  -4.270e-09  2.345e+03   0.000    1.000
## MONTH                -2.763e-09  8.438e+02   0.000    1.000
## factor(DAY_OF_WEEK)2  1.699e-09  2.603e+03   0.000    1.000
## factor(DAY_OF_WEEK)3  2.947e-09  2.487e+03   0.000    1.000
## factor(DAY_OF_WEEK)4 -2.583e-10  2.432e+03   0.000    1.000
## factor(DAY_OF_WEEK)5  2.419e-09  2.415e+03   0.000    1.000
## factor(DAY_OF_WEEK)6  2.439e-08  2.786e+03   0.000    1.000
## factor(DAY_OF_WEEK)7 -1.888e-09  2.558e+03   0.000    1.000
## DISTANCE              8.025e-13  1.191e+00   0.000    1.000
## hour_cat6 to 12      -8.322e-09  4.799e+03   0.000    1.000
## hour_cat12 to 18     -5.579e-09  4.674e+03   0.000    1.000
## hour_cat18 to 24     -5.768e-09  4.678e+03   0.000    1.000
## NORTHEASTyes         -9.432e-09  9.070e+03   0.000    1.000
## MIDWESTyes           -4.928e-09  9.143e+03   0.000    1.000
## SOUTHyes             -5.917e-09  9.003e+03   0.000    1.000
## WESTyes              -8.009e-09  9.036e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 269515  degrees of freedom
## Residual deviance: 1.5636e-06  on 269489  degrees of freedom
## AIC: 54
## 
## Number of Fisher Scoring iterations: 25
exp(coef(logit_model))
##          (Intercept)  OP_UNIQUE_CARRIERAS  OP_UNIQUE_CARRIERB6 
##         344742669341                    1                    1 
##  OP_UNIQUE_CARRIERDL  OP_UNIQUE_CARRIEREV  OP_UNIQUE_CARRIERF9 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERHA  OP_UNIQUE_CARRIERNK  OP_UNIQUE_CARRIEROO 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERUA  OP_UNIQUE_CARRIERVX  OP_UNIQUE_CARRIERWN 
##                    1                    1                    1 
##                MONTH factor(DAY_OF_WEEK)2 factor(DAY_OF_WEEK)3 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)4 factor(DAY_OF_WEEK)5 factor(DAY_OF_WEEK)6 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)7             DISTANCE      hour_cat6 to 12 
##                    1                    1                    1 
##     hour_cat12 to 18     hour_cat18 to 24         NORTHEASTyes 
##                    1                    1                    1 
##           MIDWESTyes             SOUTHyes              WESTyes 
##                    1                    1                    1
# predictor: carrier (adjusted)
# Summer
rm(Spring)
Summer <- df %>%
  filter (SUMMER=="yes")

logit_model <- glm(DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data=Summer, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(logit_model)
## 
## Call:
## glm(formula = DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     family = "binomial", data = Summer)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 2.409e-06  2.409e-06  2.409e-06  2.409e-06  2.409e-06  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           2.657e+01  1.003e+04   0.003    0.998
## OP_UNIQUE_CARRIERAS  -1.111e-08  4.817e+03   0.000    1.000
## OP_UNIQUE_CARRIERB6   1.562e-10  2.816e+03   0.000    1.000
## OP_UNIQUE_CARRIERDL   1.096e-09  2.393e+03   0.000    1.000
## OP_UNIQUE_CARRIEREV   6.147e-09  3.229e+03   0.000    1.000
## OP_UNIQUE_CARRIERF9   8.717e-10  4.641e+03   0.000    1.000
## OP_UNIQUE_CARRIERHA  -1.534e-08  1.023e+04   0.000    1.000
## OP_UNIQUE_CARRIERNK  -1.273e-10  4.024e+03   0.000    1.000
## OP_UNIQUE_CARRIEROO  -1.895e-08  2.708e+03   0.000    1.000
## OP_UNIQUE_CARRIERUA   2.784e-16  2.602e+03   0.000    1.000
## OP_UNIQUE_CARRIERVX  -1.130e-08  5.435e+03   0.000    1.000
## OP_UNIQUE_CARRIERWN  -4.093e-12  2.042e+03   0.000    1.000
## MONTH                 1.008e-08  7.753e+02   0.000    1.000
## factor(DAY_OF_WEEK)2  2.778e-08  2.344e+03   0.000    1.000
## factor(DAY_OF_WEEK)3  2.765e-08  2.334e+03   0.000    1.000
## factor(DAY_OF_WEEK)4  2.720e-08  2.212e+03   0.000    1.000
## factor(DAY_OF_WEEK)5  2.745e-08  2.197e+03   0.000    1.000
## factor(DAY_OF_WEEK)6  2.691e-08  2.454e+03   0.000    1.000
## factor(DAY_OF_WEEK)7  2.784e-08  2.377e+03   0.000    1.000
## DISTANCE              1.305e-12  1.109e+00   0.000    1.000
## hour_cat6 to 12       4.340e-09  3.964e+03   0.000    1.000
## hour_cat12 to 18     -1.283e-08  3.835e+03   0.000    1.000
## hour_cat18 to 24      2.094e-09  3.824e+03   0.000    1.000
## NORTHEASTyes          3.556e-10  7.615e+03   0.000    1.000
## MIDWESTyes           -6.186e-09  7.670e+03   0.000    1.000
## SOUTHyes              8.440e-10  7.543e+03   0.000    1.000
## WESTyes               1.235e-08  7.572e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 322372  degrees of freedom
## Residual deviance: 1.8703e-06  on 322346  degrees of freedom
## AIC: 54
## 
## Number of Fisher Scoring iterations: 25
exp(coef(logit_model))
##          (Intercept)  OP_UNIQUE_CARRIERAS  OP_UNIQUE_CARRIERB6 
##         344742560264                    1                    1 
##  OP_UNIQUE_CARRIERDL  OP_UNIQUE_CARRIEREV  OP_UNIQUE_CARRIERF9 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERHA  OP_UNIQUE_CARRIERNK  OP_UNIQUE_CARRIEROO 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERUA  OP_UNIQUE_CARRIERVX  OP_UNIQUE_CARRIERWN 
##                    1                    1                    1 
##                MONTH factor(DAY_OF_WEEK)2 factor(DAY_OF_WEEK)3 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)4 factor(DAY_OF_WEEK)5 factor(DAY_OF_WEEK)6 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)7             DISTANCE      hour_cat6 to 12 
##                    1                    1                    1 
##     hour_cat12 to 18     hour_cat18 to 24         NORTHEASTyes 
##                    1                    1                    1 
##           MIDWESTyes             SOUTHyes              WESTyes 
##                    1                    1                    1
# predictor: carrier (adjusted)
# Fall
rm(Summer)
Fall <- df %>%
  filter (FALL=="yes")

logit_model <- glm(DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data=Fall, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(logit_model)
## 
## Call:
## glm(formula = DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     family = "binomial", data = Fall)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 2.409e-06  2.409e-06  2.409e-06  2.409e-06  2.409e-06  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           2.657e+01  1.707e+04   0.002    0.999
## OP_UNIQUE_CARRIERAS  -3.192e-09  5.843e+03   0.000    1.000
## OP_UNIQUE_CARRIERB6   2.512e-09  3.983e+03   0.000    1.000
## OP_UNIQUE_CARRIERDL  -4.367e-08  3.340e+03   0.000    1.000
## OP_UNIQUE_CARRIEREV   2.736e-09  4.357e+03   0.000    1.000
## OP_UNIQUE_CARRIERF9   5.362e-09  5.960e+03   0.000    1.000
## OP_UNIQUE_CARRIERHA  -1.147e-08  1.044e+04   0.000    1.000
## OP_UNIQUE_CARRIERNK   2.481e-09  5.625e+03   0.000    1.000
## OP_UNIQUE_CARRIEROO   5.777e-09  3.525e+03   0.000    1.000
## OP_UNIQUE_CARRIERUA   3.993e-10  3.550e+03   0.000    1.000
## OP_UNIQUE_CARRIERVX  -2.829e-09  6.595e+03   0.000    1.000
## OP_UNIQUE_CARRIERWN   4.467e-09  2.867e+03   0.000    1.000
## MONTH                 7.707e-09  1.071e+03   0.000    1.000
## factor(DAY_OF_WEEK)2  3.043e-08  3.142e+03   0.000    1.000
## factor(DAY_OF_WEEK)3  2.877e-08  3.110e+03   0.000    1.000
## factor(DAY_OF_WEEK)4  2.941e-08  2.945e+03   0.000    1.000
## factor(DAY_OF_WEEK)5  2.994e-08  2.875e+03   0.000    1.000
## factor(DAY_OF_WEEK)6  2.921e-08  3.486e+03   0.000    1.000
## factor(DAY_OF_WEEK)7  3.004e-08  2.941e+03   0.000    1.000
## DISTANCE              1.222e-12  1.464e+00   0.000    1.000
## hour_cat6 to 12       2.953e-09  6.758e+03   0.000    1.000
## hour_cat12 to 18      3.353e-09  6.640e+03   0.000    1.000
## hour_cat18 to 24     -2.624e-08  6.659e+03   0.000    1.000
## NORTHEASTyes         -7.395e-09  1.127e+04   0.000    1.000
## MIDWESTyes           -7.153e-09  1.129e+04   0.000    1.000
## SOUTHyes             -2.639e-08  1.115e+04   0.000    1.000
## WESTyes              -8.377e-09  1.118e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 180953  degrees of freedom
## Residual deviance: 1.0498e-06  on 180927  degrees of freedom
## AIC: 54
## 
## Number of Fisher Scoring iterations: 25
exp(coef(logit_model))
##          (Intercept)  OP_UNIQUE_CARRIERAS  OP_UNIQUE_CARRIERB6 
##         344742568285                    1                    1 
##  OP_UNIQUE_CARRIERDL  OP_UNIQUE_CARRIEREV  OP_UNIQUE_CARRIERF9 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERHA  OP_UNIQUE_CARRIERNK  OP_UNIQUE_CARRIEROO 
##                    1                    1                    1 
##  OP_UNIQUE_CARRIERUA  OP_UNIQUE_CARRIERVX  OP_UNIQUE_CARRIERWN 
##                    1                    1                    1 
##                MONTH factor(DAY_OF_WEEK)2 factor(DAY_OF_WEEK)3 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)4 factor(DAY_OF_WEEK)5 factor(DAY_OF_WEEK)6 
##                    1                    1                    1 
## factor(DAY_OF_WEEK)7             DISTANCE      hour_cat6 to 12 
##                    1                    1                    1 
##     hour_cat12 to 18     hour_cat18 to 24         NORTHEASTyes 
##                    1                    1                    1 
##           MIDWESTyes             SOUTHyes              WESTyes 
##                    1                    1                    1
# predictor: carrier (adjusted)
# Winter
rm(Fall)
Winter <- df %>%
  filter (WINTER=="yes")

logit_model <- glm(DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, data=Winter, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(logit_model)
## 
## Call:
## glm(formula = DEP_DEL15 ~ OP_UNIQUE_CARRIER + MONTH + factor(DAY_OF_WEEK) + 
##     DISTANCE + hour_cat + NORTHEAST + MIDWEST + SOUTH + WEST, 
##     family = "binomial", data = Winter)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 2.409e-06  2.409e-06  2.409e-06  2.409e-06  2.409e-06  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           2.657e+01  1.110e+04   0.002    0.998
## OP_UNIQUE_CARRIERAS   2.619e-06  5.006e+03   0.000    1.000
## OP_UNIQUE_CARRIERB6   2.754e-06  3.459e+03   0.000    1.000
## OP_UNIQUE_CARRIERDL   2.587e-06  2.853e+03   0.000    1.000
## OP_UNIQUE_CARRIEREV   2.675e-06  3.521e+03   0.000    1.000
## OP_UNIQUE_CARRIERF9   3.039e-06  5.219e+03   0.000    1.000
## OP_UNIQUE_CARRIERHA   2.089e-06  7.447e+03   0.000    1.000
## OP_UNIQUE_CARRIERNK   2.808e-06  4.838e+03   0.000    1.000
## OP_UNIQUE_CARRIEROO   2.876e-06  3.113e+03   0.000    1.000
## OP_UNIQUE_CARRIERUA   2.727e-06  3.110e+03   0.000    1.000
## OP_UNIQUE_CARRIERVX   2.302e-06  6.175e+03   0.000    1.000
## OP_UNIQUE_CARRIERWN   2.897e-06  2.532e+03   0.000    1.000
## MONTH                 7.692e-09  1.457e+02   0.000    1.000
## factor(DAY_OF_WEEK)2 -7.604e-08  2.683e+03   0.000    1.000
## factor(DAY_OF_WEEK)3 -7.240e-08  2.728e+03   0.000    1.000
## factor(DAY_OF_WEEK)4 -3.366e-08  2.626e+03   0.000    1.000
## factor(DAY_OF_WEEK)5 -5.111e-08  2.552e+03   0.000    1.000
## factor(DAY_OF_WEEK)6 -5.306e-08  2.739e+03   0.000    1.000
## factor(DAY_OF_WEEK)7 -3.124e-06  2.643e+03   0.000    1.000
## DISTANCE             -7.319e-10  1.264e+00   0.000    1.000
## hour_cat6 to 12      -7.447e-07  5.308e+03   0.000    1.000
## hour_cat12 to 18     -2.055e-07  5.206e+03   0.000    1.000
## hour_cat18 to 24     -2.097e-07  5.230e+03   0.000    1.000
## NORTHEASTyes         -3.110e-06  9.747e+03   0.000    1.000
## MIDWESTyes           -2.841e-07  9.785e+03   0.000    1.000
## SOUTHyes             -4.140e-07  9.655e+03   0.000    1.000
## WESTyes              -3.351e-07  9.673e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 241001  degrees of freedom
## Residual deviance: 1.3982e-06  on 240975  degrees of freedom
## AIC: 54
## 
## Number of Fisher Scoring iterations: 25
exp(coef(logit_model))
##          (Intercept)  OP_UNIQUE_CARRIERAS  OP_UNIQUE_CARRIERB6 
##         3.447435e+11         1.000003e+00         1.000003e+00 
##  OP_UNIQUE_CARRIERDL  OP_UNIQUE_CARRIEREV  OP_UNIQUE_CARRIERF9 
##         1.000003e+00         1.000003e+00         1.000003e+00 
##  OP_UNIQUE_CARRIERHA  OP_UNIQUE_CARRIERNK  OP_UNIQUE_CARRIEROO 
##         1.000002e+00         1.000003e+00         1.000003e+00 
##  OP_UNIQUE_CARRIERUA  OP_UNIQUE_CARRIERVX  OP_UNIQUE_CARRIERWN 
##         1.000003e+00         1.000002e+00         1.000003e+00 
##                MONTH factor(DAY_OF_WEEK)2 factor(DAY_OF_WEEK)3 
##         1.000000e+00         9.999999e-01         9.999999e-01 
## factor(DAY_OF_WEEK)4 factor(DAY_OF_WEEK)5 factor(DAY_OF_WEEK)6 
##         1.000000e+00         9.999999e-01         9.999999e-01 
## factor(DAY_OF_WEEK)7             DISTANCE      hour_cat6 to 12 
##         9.999969e-01         1.000000e+00         9.999993e-01 
##     hour_cat12 to 18     hour_cat18 to 24         NORTHEASTyes 
##         9.999998e-01         9.999998e-01         9.999969e-01 
##           MIDWESTyes             SOUTHyes              WESTyes 
##         9.999997e-01         9.999996e-01         9.999997e-01

We summarized the results (stratified by season) for logistic regressions into the table below: Table 7

Similar to findings in the linear regressions results, here we also see that when compared to American Airline, the odds of delaying 15+ minutes for Hawaiian Airline is the lowest across all seasons, and the odds of delaying 15+ minutes is highest among either JetBlue or Virgin America (Depending on seasons).